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Abstract 

In mixtures-of-experts (ME) model, where a number of submodels (experts) 

r I 

^ ■ are combined, there have been two longstanding problems: (i) how many experts 

(-H I should be chosen, given the size of the training data? (ii) given the total number 

. of parameters, is it better to use a few very complex experts, or is it better to com- 
bine many simple experts? In this paper, we try to provide some insights to these 

^ . problems through a theoretic study on a ME structure where m experts are mixed, 

(yQ • with each expert being related to a polynomial regression model of order k. We 

^ I study the convergence rate of the maximum likelihood estimator (MLE), in terms 



(N 

d 



X 



of how fast the KuUback-Leibler divergence of the estimated density converges to 
the true density, when the sample size n increases. The convergence rate is found 
to be dependent on both m and k, and certain choices of m and k are found to 
produce optimal convergence rates. Therefore, these results shed light on the two 
aforementioned important problems: on how to choose m, and on how m and k 
should be compromised, for achieving good convergence rates. 

Keywords: Convergence Rate, Approximation Rate, Nonparametric Regression, Exponential 
Family, Hierarchical Mixture-of-Experts, Mixture-of-Experts, Maximum Likelihood estimation 

1 Introduction 

Mixture-of-experts models (ME) [Jacobs et al., 1991] and hierarchical mixture-of-experts 
models (HME) [Jordan and Jacobs, 1994] are powerful tools for estimating the den- 
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sity of a random variable Y conditional on a known set of covariates X. The idea 
is to "divide-and-conquer". We first divide the covariate space into subspaces, then 
approximate each subspace by an adequate model and, finally, weigh by the prob- 
ability that X falls in each subspace. Additionally, it can be seen as a generaliza- 
tion of the classical mixture-of-models, whose weights are constant across the covari- 
ate space. Mixture-of-experts have been widely used on a variety of fields includ- 
ing image recognition and classification, medicine, audio classification and finance. 
Such flexibility have also inspired a series of distinct models including Wood et al. 
[2002], Carvalho and Tanner [2005a], Geweke and Keane [2007], Wood et al. [2008], 
Villanietal. [2009], Young and Hunter [2010] and Wood et al. [2011], among many 
others. 

We consider a framework similar to Jiang and Tanner [1999a] among others. As- 
sume each expert is in a one-parameter exponential family with mean (f{hk), where hk 
is a /c*^ -degree polynomial on the conditioning variables X (hence a linear function of 
the parameters) and </?(■) is the inverse link function. In other words, each expert is 
a Generalized Linear Model on an one-dimensional exponential family (GLMl). We 
allow the target density to be in the same family of distributions, but with conditional 
mean (f{h) with h E ^V^Ko^ ^ Sobolev class with a derivatives. Some examples of 
target densities include the Poisson, binomial, Bernoulli and exponential distributions 
with unknown mean. Normal, gamma and beta distributions also fall in this class if the 
dispersion parameter is known. 

One might be reluctant to use (H)ME models with polynomial experts since it leads 
to more and more complex models as the degree k of the polynomials increases. The 
discussion whether is better to mixture many simple models or fewer more complex 
models is not new in the literature of mixture-of-experts. Earlier in the literature, 
Jacobs et al. [1991] and Peng et al. [1996] proposed mixtures of many simple models; 
more recently. Wood et al. [2002] and Villani et al. [2009] considered using only a few 
complex models. Celeux et al. [2000] and Geweke [2007] advocate for mixing fewer 
complex models, claiming that mixture models can be very difficult to estimate and 
interpret. We justify the use of such models through the approximation and estimation 
errors. We illustrate that might be a gain in a small increase of k compared to the lin- 
ear model k = 1 but the number of parameters increases exponentially as k increases. 
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Therefore, a balance between the complexity of the model and the number of experts is 
required for achieving better error bounds. 

This work extends Jiang and Tanner [1999a] in few directions. We show that, by in- 
cluding polynomial terms, one is able to improve the approximation rate on sufficiently 
smooth classes. This rate is sharp for the piecewise polynomial approximation as shown 
in Windlund [1977]. Moreover, we contribute to the literature by providing rates of con- 
vergence of the maximum likelihood estimator to the true density. We emphasize that 
such rates have never been developed for this class of models and the method used can 
be straightforwardly generalized to more general classes of mixture of experts. Conver- 
gence of the estimated density function to the true density and parametric convergence 
of the quasi-maximum likelihood estimator to the pseudo-true parameter vector are also 
obtained. 

We found that, under slightly weaker conditions than Jiang and Tanner [1999a], 
the approximation rate in KuUback-Leibler divergence is uniformly bounded by c x 
^-2[aA(fc+i)]/s^ where c is some constant not depending on k or m, and s the number of 
independent variables. This is a generalization of the rate found in Jiang and Tanner 
[1999a] who assume a = 2 and k = 1. The convergence rate of the maximum 
likelihood estimator to the true density is Op (^m"^^"^*^''"*"^''!/'^ + (m + Vm)n~^ log nj , 
where is the total number of parameters in each polynomial (typically k + s choose k), 
and Vm is the number of parameters in the weight functions. To show the previous re- 
sults we do not assume identifiability of the model as it is natural for mixture-of-experts 
to be unidentifiable under permutation of the experts. If we further assume identifia- 
bility [Jiang and Tanner, 1999a, Mendes et al., 2006], and that the likelihood function 
has a unique maximizer, we are able to remove the "logn" term in the convergence 
rate. Optimal nonparametric rates of convergence can be attained if A; = a — 1 and 
m = (n^/(2"+^)) [Chen, 2006, Stone, 1980, 1985]. 

Zeevi et al. [1998] show approximation in the norm and estimation error for the 
conditional expectation of the ME with generalized linear experts. Jiang and Tanner 
[1999a] show consistency and approximation rates for the HME with generalized linear 
model as experts and a general specification for the gating functions. They consider the 
target density to belong to the exponential family with one parameter. Their approxima- 
tion rate of the KuUback-Leibler divergence between the target density and the model 



3 



is 0(l/m^/^), where m the number of experts and s the number of covariates. Norets 
[2010] show the approximation rate for the mixture of Gaussian experts where both 
the variance and the mean can be nonlinear and the weights are given by multinomial 
logistic functions. He considers the target density to be a smooth continuous function 
and the dependent variable Y to be continuous and satisfy some moment conditions. 
His approximation rate is 0(l/m'*+^"'"^/*^'^~^)+^), where Y is assumed to have at least 
q moments and £ is a small number. Despite these findings, there are no convergence 
rates yet for the maximum likelihood estimator of mixture-of-experts type of models in 
the literature. 

By studying the convergence rates in this paper, we will be able to shed light on 
two long-standing problems in ME: (i) How to choose the number of experts m for a 
given sample size n? ( ii) Is it better to mix many simple experts or to mix a few complex 
experts? None of the works discussed above directly address these questions. Our study 
of a ME structure mixing m of the /cth order polynomial submodels is particularly useful 
in studying problem (i), which cannot be studied in the framework of [Jiang and Tanner, 
1999a], for example, who have restricted to the special case k = 1. 

Throughout the paper we use the following notation. Let x = {xi, . . . ,Xs) G S 
and h{x) : S — )■ M an A denote some measure. For any finite vector x we use 
kl = Z]j=il^il \^\p = (Zlj=il^iP) , ior p e [l,oo), if j> = oo we take 
|x|oo = supj-^]^^ \xj\. For some function h{x) and measure A we denote ||/i||p,s = 
(/^ \h\dXY^^, forp e [1, oo), and forp = oo we have ||/i||oo,s = esssup^^^s \h{x)\. 

The remainder of the paper is organized as follows. In the next section we introduce 
the target density and mixture of experts models. We also demonstrate that the quasi- 
maximum likelihood estimator converges to the pseudo-true parameter vector. Section 
3 establishes the main results of the paper: approximation rate, convergence rate and 
non-parametric consistency. Section 4 discusses model specification and the tradeoff 
that we unveil between the number of experts and the degree of the polynomials. In the 
concluding remarks we compare our results with Jiang and Tanner [1999a] and provide 
direction for future research. The appendix collects technical details of the paper and a 
deeper treatment on how to bound the estimation error. 
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2 Preliminaries 



In this section we introduce the target class of density, mixture-of-experts model with 
GLMl experts and the estimation algorithm. 

2.1 Target density 

Consider a sequence of random vectors {{X-, Yj)'}^^-^ defined on ((fixA)", B^qxA)", Pxy) 
where X ^ Vt c W, F G A C M and Bs is the Borel a-algebra generated by the set 
S. We assume that Pxy has a density pxy = Py\xPx with respect to some measure A. 
More precisely, we assume that Px is known and Py\x is member of an one-dimensional 
exponential family, i.e. 

Pylx = exp {ya{h{x)) + b{h{x)) + c{y)} , (1) 

where a(-) and b{-) are known three times continuously differentiable functions, with 
first derivative bounded away from zero and a(-) has a non-negative second derivative; 
c(-) is a known measurable function of Y. The function /i(-) is a member of W^/^^j (fi), 
a Sobolev class of order . Throughout the paper denote by n(W^^^) the class of 
density functions pxy = Py\xPx- 

The one-parameter exponential family of distributions includes the Bernoulli, expo- 
nential, Poisson and binomial distributions, it also includes the Gaussian, gamma and 
WeibuU distributions if the dispersion parameter is known. It is possible to extend the 
results to the case where the dispersion parameter is unknown, but defined in a compact 
subset bounded away from zero. In this work we focus only in the one-parameter case. 

Some properties of the one parameter exponential family are : (i) conditional on 
X = X, the moment generating function of y exists in a neighborhood of the origin 
implying that moments of all orders exist; (ii) for each positive integer j, iJ.{j){h) = 
J^y^ exp[a{h)y + b{h) + c{y)]dX is a differentiable function of h; and (iii) the first 
conditional moment fi[i){h) = —h{h)/a{h) = '^{h), where a{h) and h{h) are the first 
derivatives of a{h) and h{h) respectively, and Lp{-) is called the inverse link function. 

^ Suppose \ < p < oo and a > is an integer. We define VV^ {VI) as the collection of measurable 
functions h with all distributional derivatives -D' /, \r\ < a, on £^(17), i.e. \\D^hWp^n < Kq. Here 

]jr _ g\r\ j [(firj.^ ^ _ ^ d^'Xs) and |r| = ri + • • • + for r = (ri, . . . , Tg). 
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See Lehmann [1991] and McCuUagh and Nelder [1989] for more results about the ex- 
ponential family of distributions. 



2.2 ]VIixture-of-experts model 

The mixture-of-experts model with GLMl experts is defined as: 

m 

fmA^^yX) = ^9jix;i^)Tr{hk{x,ej),y) ■ 



m 

= ^9jix;i^) exp{ya{hkix;ej)) + b{hkix;ej)) + c{y)} ■ p^, (2) 

where the functions gj > and YlJLi 9j = 1 with parameters z/ G C £'^(W""')'^, Vm 
denoting the dimension of u. The functions hk{x] 9j) are A;*^'-degree polynomials on f2 
with parameter vector 9j G 0^ C £^(]R Jk denoting the dimension of Oj; write the 
vector of parameters of all experts as 6^ = {9[, . . . , 6m)' defined on Qmk = 0™- The 
parameter vector of the model is C = (z^', d')' and is defined on Vm x Qmk, a subset of 
j^i'mxmjfe Throughout the paper we denote by J'm,k the class of (approximant) densities 

fm,k ■ 

To derive consistency and convergence rates, one need to impose some restrictions 
on the functions tt and gj to avoid abnormal cases. This condition is not restrictive 
and is satisfied by the multinomial logistic weight functions (g's) and the Bernoulli, 
binomial, Poisson and exponential experts, among many other classes of distributions 
and weight functions. 

Assumption 1. There exist functions Cg{x) = {c'g\x), . . . ,c^g"'\x)y and F{x,y) = 
...,F(^^)(x,y))'mY/iE[cg(X)'cg(X)] < oo and E[F{X,YyF{X,Y)] < 
oo, such that the vector-function g{x; v) = {gi{x; z/), . . . , gm{,x] u))' satisfy 

sup < cl>{x); 

and each expert 'n{hk{x; Oj),y) satisfy 

sup ^ ^» ^^^y^^ ^^^^ 1 < J < m. 



-We denote £2(Rfe) = {x g M'^ : ^^'^^^ x'j < oo}. 
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2.3 Maximum likelihood estimation and the EM algorithm 
2.1 Maximum likelihood estimation 

We consider the maximum likelihood method of estimation. We want to find the pa- 
rameter vector (n = i'^'ny^n)' that maximizes 

n 

LniO = n^' Yl {fmAX^, Yf, C)/</^o(^., Y^} , (3) 

i=l 

where v?o(^, Y) = exp{c{Y))px{X). That is, 

Cn = arg max L„(C). (4) 

The maximum likelihood estimator is not necessarily unique. In general, mixture-of- 
experts models are not identifiable under permutation of the experts. To circumvent this 
issue one must impose restrictions on the experts and the weighting (or the parameter 
vector of the model), as shown in Jiang and Tanner [1999b]. 

Define the KuUback-Leibler (KL) divergence between p^y and fm,k as 

KL{p^yJrn,k) = I I \og ^j^dPy\^dP^. (5) 

Jo. J A Jm,k 

The log-likelihood function in (3) converges to its expectation with probability one as 
the number of observations increases. Therefore, in the limit, the minimizer fm k of 
(3) (indexed by Cn) also minimizes the KuUback-Leibler divergence between the true 
density and the estimated density. 

In this work only consider i.i.d. observations but is straightforward to extend the 
results to more general data generating processes. Next assumption formalizes it. 

Assumption 2 (Data Generating Process). The sequence (Xj, n = 1,2,... is 

an independent and identically distributed sequence of random vectors with common 
distribution Pxy 

Next results ensures the existence of such estimator. 

Theorem 2.1 (Existence). For a sequence {(Kn x 0mfc)n} of compact subsets ofVm x 
©mfc. n = l!2, there exists a B{Q x A)— measurable function : x A — j- 
(Vm X Qrnk)n, Satisfying equation (4) P^y-almost surely. 
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We demonstrate that under the classical assumptions, such as identifiability and 
unique maximizer, the maximum likelihood estimator ( consistently estimate the best 
model in the class J'mk indexed by (*, i.e. the maximum likelihood estimator ( con- 
verges almost surely to (^*. It can be shown that the convergence results also hold for 
the ergodic case if we assume that (log fm,k{Xi, Yi] C))r=i is ergodic. However, simpler 
conditions to ensure ergodicity of the likelihood function are not trivial and hence out 
of the scope of this paper. 

Assumption 3 (Identifiability). For any distinct (i and (2 in Vm x ©m/o for almost every 

{x, y) eQ X A, 

fm,k{^^ Ci) 7^ /m,fc(a;, y; C2) 



Jiang and Tanner [1999b] find sufficient conditions for identifiability of the param- 
eter vector for the HME with one layer, while Mendes et al. [2006] for a binary tree 
structure. Both cases can be adapted to more general specifications. Although one can 
show consistency to a set, we adopt a more traditional approach requiring identifiability 
of the parameter vector. 

Assumption 4 (Unique Maximizer). Let ( = {u', 9')' and (* the argument that maxi- 
mizes E log fm,k over Vm x 6^^. Then 



This assumption follows from a second order Taylor expansion of the expected like- 
lihood around the parameter vector that maximizes (5), denoted C* . We require the 
Hessian to be invertible at The requirement for an identifiable unique maximizer is 
only technical in a sense that the objective function is not allowed to become too flat 
around the maximum (For more discussion on this topic see Bates and White [1985], 
pg 156, and White [1996] chapter 3). A similar assumption was made in the series of 
papers from Carvalho and Tanner [2005a,b, 2006, 2007] and Zeevi et al. [1998] and is 
an usual assumption in the estimation of misspecified models. 
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Theorem 2.2 (Parametric consistency of misspecified models). Under Assumptions 1, 
2, 3, and 4, the maximum likelihood estimate ( (* as n ^ oo Pxy-a.s. 

Huerta et al. [2003] and the series of papers by Carvalho and Tanner [2005a,b, 2006, 
2007] derive similar results for time series processes. 

2.2 The EM algorithm 

It is often easier to maximize the complete likelihood function of a (H)ME instead of 
(3) (see Jordan and Jacobs [1994], Xu and Jordan [1996] and Yang and Ma [2011]). 
Let z[ = {zii, ■ ■ ■ , Zim) denote a binary vector with Zij = 1 if the observation (xj, yi) 
is generated by the expert j (i.e. n(hk{-,9j), ■)). We assume Zi has a multinomial 
distribution with parameters r/ = {th, ■ ■ ■ , Tim)- The complete log-likelihood function 
is given by 

n m 

Inif^) = ^'^^ij (^oggi{xi,u) + log 7r{hkixi; 6 j),yi) - log (po{xi,yi)) , (7) 
i=i j=i 

where k, = (z/', 9', r')'. 

We can estimate this model using the expectation-maximization (EM) algorithm 
put forward by Dempster et al. [1977]. Let k^'-^ = 6^^'^ r^'^) denote the parameter 
estimates at the /th iteration and define q{K; k^''^) = E(/^|x, y; k^'^). In the E-step, we 
obtain k^'^) by replacing Zij with its expectation 

^(i) _ 9j{xi,u^^'>)'K{hk{xi]ef),yi) 

In the M-step we maximize q{K] k^'-^ ) with respect to v and 6. The problem simplifies 
to find the parameters that maximize 

n 

q{u-K^'y) = Y,4^^ogg,{xf,u), (8) 

i=l 

and to find the parameters 6''^'+^) we have to maximize 

n k 

q{9- «:«) = Y.ll''^f [y^<hu{x,- 9,)) + h{hk{x,- %))]. (9) 
i=i i=i 
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3 Main results 

In this section we present the main results of the paper. Write the KL-divergence as 
follows: 



KL{p^y, fm,k) = KL{p^y, f^ i^) + E 



log 



f* 

J m,k 



(10) 



fm,k _ 

where f*^ ^ is the minimizer of the minimizer of KL(pxy, fm,k) on J-'m,k- The first term 
in the right-hand side is the approximation error and the second term is the estimation 
error. The approximation error measures "how well" an element of J-'m,k approximates 
Pxy, and approaches zero as m increases. The estimation error measures "how far" 
is the estimated model from the best approximant in the class. Our goal is to find 
bounds for both approximation and estimation errors and combine these results to find 
the convergence rate of the maximum likelihood estimator. 



3.1 Approximation rate 

We follow Jiang and Tanner [1999a] to bound the approximation error. Define the upper 
divergence between p e n(yV^^J and fm,k ^ -Fm.fc as 

» m 

'D{pJm,k)= / y29j{x,u){hk{x;e,)-h{x)ydP,. (11) 
Jn ^.^1 

We can use the upper divergence to bound the KL-divergence. 
Lemma 3.1. Let p G n(>V^j^J and f,m,k e J^m,k- //'ess 

KL{p, fm,k) < MooT>{p, fm,k) 

where > {l/2)esssnp^(,^[\ip{h{x))\ ■ \d{h{x))\ + \b{h{x))\]. 

This lemma will be used to bound uniformly the approximation rate of the family 
of functions J'm,k- 

Before presenting the main conditions, we shall introduce some key concepts. 

Definition 3.1 (Fine partition). For m = 1,2, let Q'^ = {QY}^=i be a partition 

of Q. If m — )■ oo and if for all X I, X2 G Q^, maxi<j<s |(xi — X2)i\ < CQ/rm^,for 
some constant cq independent ofxi, X2, m or j. Then {Q"^, m = 1,2, ...} is called a 
sequence of fine partitions with cardinality and bounding constant Cq. 
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Here we use some abuse of notation by using m as an index of the collection of 
partitions of f2. However, this abuse of notation is justified because m is an increasing 
sequence and the collection of partitions depends on an increasing function of m. The 
next definition will be useful later to bound the "growth rate" of the model and is useful 
to deal with hierarchical mixture of experts (see Jiang and Tanner [1999a]). 

Definition 3.2 (Subgeometric). A sequence of numbers aj is called sub-geometric with 
rate bounded by Mi ifaj G N, aj oo as j ^ oo, and 1 < \aj^i/aj\ < Mi for all 
j = 1,2,... and for some finite constant Mi. 

The key idea behind find the approximation rates, is to control the approximation 
rate inside each fine partition of the space. More precisely, bound the approximation 
inside the "worst" (more difficult to approximate) partition. We need the following 
assumption. 

Assumption 5. There exists a fine partition Q"^ of Q, with bounding constant cq and 
cardinality sequence r^, m = 1,2, such that {r^} is sub-geometric with rate 
bounded by some constant Mi, and there exists a constant ci > 0, and a parameter 
vector Uc. G Vm such that 



This assumption is similar, but weaker than, the one employed in Jiang and Tanner 
[1999a] and requires that the vector g = {gi, ■ ■ ■ , gr^) approximates the vector of char- 



The notation r„i is introduced to deal with the hierarchical mixture of experts struc- 
ture. To allow more flexibility define as the maximum number of experts the struc- 
ture can hold, e.g. a binary tree with / = 1,2,... layers has at most 2' experts, and if 
we increase the number of layers by one, the actual number of experts is somewhere 
between 2' and 2^^^ — 1 (here we are assuming the tree is balanced without loss of gen- 
erality). If we denote this class of models by J"*,,^ fc, then J!*^ ^ C J"*^ ^, c Jv^^^^. The 
sub-geometric assumption ensures that < m < r^+i, where m is the actual number 
of experts in the model. 




(12) 



acteristic functions (Jq 



■ ■ ■ , /gm ) at a rate not slower then 0(rm). 
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Theorem 3.1 (Approximation rate). Letp e n(W^^j^) and fm,k ^ J^m,k- if assumption 
5 holds, then 

/or some constant c not depending on m or k. 

This result is a generalization of Jiang and Tanner [1999a] in two directions. First 
we allow the target function to be in a Sobolev class with a derivatives; second, we 
consider a polynomial approximation to the target function in each experts (in fact, 
their result is a special case when a = 2 and k = 1). This generalization enables us 
to address the important problem: whether it is better to mix many simple experts or to 
mix a few complex experts. The result also holds under more general specifications of 
densities/experts. In the case we also have a dispersion parameter to estimate, we just 
have to modify the lemma 3.1 accordingly and the same result holds. 

This rate also agrees with the optimal approximation rate of functions on W^j^-^ by 
piecewise polynomials [Windlund, 1977]. One can see that, under assumption 5, it is 
exactly what we are doing. Therefore this approximation rate is sharp. 

3.2 Convergence rate 

In this section we deduce the convergence rate for the mixture-of-experts model. Equa- 
tion (10) gives us an expansion of the KL divergence in terms of the approximation and 
estimation errors. In the previous section we found a bound for the approximation er- 
ror, in this section we will find the estimation error and combine with the approximation 
error to find the rate of convergence. 

The estimation error is the "how far" is the estimated function from the best approx- 
imant in the class. We will demonstrate that the estimation error in (10) is Op{{mJk + 
Vm){^ogn/n)). We also show that by combining this result with the approximation 
rate it is possible to achieve a convergence rate of Op((logn/n)^'^/(^'^"'"'^)), with r = 
a A (k + 1), which is close to the optimal nonparametric rate if r = a. Moreover, 
if there is an unique identifiable maximizer to the likelihood problem (assumptions 3 
and 4), we are able to remove the "logn" term and achieve a better convergence rate, 
possibly optimal if r = a. 
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The next theorem summarizes the convergence rate of the maximum likelihood es- 
timator /m,fc with respect the KL divergence between the true density p^y and the esti- 
mated density. 

Theorem 3.2 (Convergence Rate). Let p^y e ^(W^Kq) and fm,k denote its maximum 
likelihood estimator on J^m,k- Let m be allowed to increase such that m — )■ oo and 
m(logn/n) Q as n and m increase. Under Assumptions 1, 2 and 5, 

(1 log TL \ 
-^r-j- + {mJk+Vm) , (14) 

where r = a A {k + 1). In particular, if we assume Vm = 0{m), and let m be propor- 
tional to [n/ logny/^'^'^^^^ then 

(( log n \ \ 
[nj I • ^^^^ 

Although the previous result is derived for the i.i.d. case, the result also holds 
for more general data generating process. In this result we use (through van der Geer 
[2000]), an uniform probability inequality for i.i.d. processes to derive theorem A.l, 
but the same result can be obtained by using uniform inequalities for more general pro- 
cesses. This convergence rate is close to the optimal rate found in the sieves literature 
if r = a, see for instance Stone [1980] and Barron and Sheu [1991]. 

To derive this rate we do not assume that there is an unique identifiable maximizer 
f^i^; in fact, we assume f^ f^ is any of such maximizers. The price to pay for such 
generality is the inclusion of the "logn" term in the convergence rates. If we assume 

^ is unique and uniquely identified by a parameter vector (^*, we can explore the 
localization property of theorem A.l. More precisely, we can explore the fact that we 
are only interested in the behavior of the empirical process around a neighborhood of 
Under such conditions and assuming r = a, we are able to achieve the optimal 
convergence rate in the sieves literature [Barron and Sheu, 1991, Stone, 1980]. 

Theorem 3.3 (Optimal Convergence Rate). Let p^y G n(>V^j^jJ and fm,k denote its 
maximum likelihood estimator on J^m,k- Let m be allowed to increase such that m — ?■ oo 
andm/n as n and m increase. Under Assumptions 1-5, 

1 , {mJk + Vm) 



KLip^y, /^,,) = 0,1-^+ ' \ ) , (16) 



13 



where t = a A {k + 1). In particular, if we assume = 0{m), and let m be propor- 
tional to rv"! (2^+*) then 



The same result follows for more general data generating processes and the same 
considerations after theorem 3.2 hold. 

By imposing there exist an unique maximum we are able to remove the log n term 
and recover the optimal convergence rate for sieves estimates found in the literature. 

3.3 Consistency 

Now we apply the previous results to show the maximum likelihood estimator is con- 
sistent, i.e. the KL divergence between the true density and the estimated model ap- 
proaches zero as the sample size n, and the index of the approximation class m goes to 
infinity. Here we show consistency essentially by using the previous results. 

Corollary 3.1 (Consistency). Let p.j.y G n(>V^j^g) and fm,k denote its maximum like- 
lihood estimator on J^rn,k- Allow m — )■ oo and m{\ogn/n) as n and m increase. 
Under Assumptions 1, 2 and 5 , KL{j>xyi fm,k) as n and m increase. 



We consider a framework similar to Jiang and Tanner [1999a], but one is allowed to mix 
m GLMl experts whose terms are polynomials on the variables, as opposed to A; = 1. 
We also assume that the true mean function is ip{h) with h e W^^^j, a Sobolev class 
with a derivatives, as opposed to a = 2. 

By deriving a convergence rate such as (16) in this framework, we are able to gain 
insight on the two important problems in the area of ME: (i) What number of experts 
m should be chosen, given the size n of the training data? (ii) Given the total number 
of parameters, is it better to use a few very complex experts, or is it better to combine 
many simple experts? 

For question (i), the results in Theorem 3.3 and Corollary 3.1 suggest that good 
results can be obtained by choosing the number of experts m to grow as n'' with some 




(17) 



4 Effects of m and k 
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power r G (0, 1), which may depend on the dimension of the input space and the 
underlying smoothness of the target function. Smoother target functions and lower 
dimensions generally encourage us to use less experts. 

Question (ii) requires a more detailed study. The complexity of the experts (sub- 
models) are related to k, the order of the polynomials. We see that increasing k does 
improve the approximation rate, however this improvement is bounded by the number 
of derivatives a of the function h. Moreover, this approximation rate is known to be 
sharp for piecewise polynomials [Windlund, 1977]. The price to pay for this increase 
in the approximation rate is a larger number of parameters in the model, i.e. a worse 
estimation error. We will provide below a theoretical result on the optimal choice of k, 
as well as some numerical evidence. 

First of all, an easier expression of the upper bound of the KL divergence in (16) can 
be derived as A'L < Op{U) where [/ = (m~2(5^°)/« + (m^^)/r2), where ^ = k+1. [This 
assumes that v{m) = 0{m) and uses the fact that the number of parameters needed in 
s-dimensional polynomials of order k is bounded by Jk < (k + ly.] 

We now study the upper bound U, fixing the product m^^ = C, where C may 
depend on n and is a bound for the rough order of the total number of parameters. 

Proposition 4.1. Let ^ = k + 1 and U = (m"^*^^^")/* + (m^^)/n) (]) (which is an 
upper bound for the KL convergence rate derived in Theorem 3.3). Then the following 
statements are true: 

I. Fixing the product m^'^ = C, U is minimized at ^ = a A (C^/*/e) = ^o- The 
corresponding optimal m = max(e'*, C /a^). 

IL If a is finite, then U achieves the optimal rate 77,-2"/(s+2«) ^ndgy the follow- 
ing choices: ^ is any constant that is at least a and does not vary with n, and m ^ 
c-^T^^/C'^+^a) j-^^ constant Ci > 0. 

///. If a = oo, the following choices will make U to have a near-parametric rate 
U = 0{{\nnY /n): m > 2andis constant in n, ^ ~ C2liiin for any constant C2 > s. 

Remark 1. This Proposition suggests that for achieving optimal performance, the (or 
k, related to the complexity of the experts) and the m (the number of experts) should 
be compromised. Fixing an upperbound C of the total number of parameters, the op- 
timal ^ = a A (C^/*/e) = ^o- The optimal compromise therefore depends both on a 
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(smoothness of the target function) and s (the dimension of the input space). The for- 
mula implies that (a) a smoother target function ( indexed by a larger a ) will favor more 
complex submodels (with larger ^ or k), (b) for a very smooth target function (with 
large enough a), a higher dimension s of the input space will favor the use of simpler 
submodels ( with smaller or k + 1, possibly smaller than a ) and the use of more experts 
(bigger m). 

Remark 2. Although Result I shows how to construct an exactly -optimal compromise 
between C, and m, Results II and III show that good convergence rates are quite robust 
against deviations from these optimal solutions. We note that near -optimal convergence 
rates can always be achieved with C, not being too large compared to the sample size n. 
This is summarized in the two situations in Results II and III, where we see that even 
in the case a = oo, we only need about ^ ~ In n for us to achieve a near-parametric 
convergence rate. 

One drawback of the above theoretic analysis is that it has used a rough upper bound 
(which has a simple expression) for the total number of parameters associated with k\h 
order s dimensional polynomials. Below we conduct some numeric study, where the 
exact number of parameters are used. When considering the choice of k, a first impulse 
is to use polynomials of order a — 1, but the number of variables in the model increase 
exponentially with A; if s > 1. In fact, in many cases it is preferable to use a smaller 
k and many experts m if one wishes to control the size of the estimation error. This is 
consistent with the earlier Remark 1 we made for our theoretic analysis. 

Table 1 compares the approximation error using distinct values of k and m holding 
the estimation error fixed. Assume we have s = 5 variables and a = 6, a modeler 
builds a model with m = 5 experts and, since it is known that a = 6, also chooses 
/c = 5. If we further assume Vm = m x s, the total number of parameters in the model 
is m Jfc + Vm = 1285. We can see the smallest approximation error is achieved at A; = 3 
and m = 21. 

Similarly, fixing the approximation error we see that a balance between m and k is 
necessary. Fix a = 6 and s = 5 and assume one wants a model with approximation 
error proportional to 0.01. Table 2 shows that the model with smaller estimation error 
that achieves this approximation error is the one with k = 3 and m = 18. 
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Table 1 : This table compares the approximation error of the model holding the estima- 
tion error fixed. We assume a = 6 and s = 5 and allow for distinct specifications of m 
and k. 



fx 


lit 


2(fe+l)/s 

lit 


IIIU]^ -\- 


n 
u 




U.l iOy 


1 984 


1 

1 


1 1 7 


0991 


1 987 


2 


49 


0.0094 


1,274 


3 


21 


0.0077 


1,271 


4 


10 


0.0100 


1,310 


5 


5 


0.0210 


1,285 



Table 2: This table compares the number of parameters of the model holding the ap- 
proximation error fixed. We assume a = 6, s = 5 and the estimation error to be 
proportional to 0.01. We allow for distinct specifications of m and k. 



k 


m 


^-2(fc+l)/s 







100,000 


0.0100 


600,000 


1 


316 


0.0100 


3,476 


2 


46 


0.0101 


1,196 


3 


18 


0.0098 


1,098 


4 


10 


0.0100 


1,310 


5 


7 


0.0099 


1,799 



This quick exercise illustrated one of the main conclusions of this paper: it is not 
true that one should always use few complex models (small m and large k) or always 
choose for many complex ones (small k and large m); a balance between k and m 
should be used instead. Moreover, a small increase in k comparing to the linear model 
(k = 1) can have a good improvement on the approximation and estimation errors. 

The results in this paper focus only on target density and mixture-of-experts spec- 
ified in sections 2.1 and 2.2 respectively. However, similar results can be derived for 
more complex models and target densities. 
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5 Conclusion 



In this paper we study the mixture-of-experts model with experts in an one-exponential 
family with mean ip{hk), where hk is a k^^ order polynomial and ^i-) is, the inverse link 
function. We derive sharp approximation rates with respect to the KuUback-Leibler di- 
vergence and convergence rate of the maximum likelihood estimator to densities in an 
one-parameter exponential family with mean Lp{h) with h E VV^^^^, a Sobolev class 
with a derivatives. We found that the convergence rate of the maximum likelihood esti- 
mator to the true density is Op (m"'^^"'^^''^^^^^'^ + (mJ^ + Vm)n"^ logn), where n is the 
number of observations, s is the number of covariates, Jk is the number of parameters of 
the polynomial hk, m the number of experts and is the number of parameters on the 
weight functions. Further, if the maximum likelihood estimator is uniquely identified 
we can remove the "logn" term of the convergence rates. 

We discuss model specification and the effects on approximation and estimation 
errors and conclude that the best error bound is achieved using a balance between k 
and m, and inclusion of polynomial terms might render better error bounds. Also, the 
results of this paper can be generalized to more complex target densities and models 
with simple modifications to the proofs. 

We generalize Jiang and Tanner [1999a] in several directions: (i) we assume one 
can include polynomial terms of the variables on the GLMl experts; (ii) we assume the 
target density is in a W^j^'^j class, for a > 0, instead of j^^; (iii) we show consis- 
tency of the quasi-maximum likelihood estimator for fixed number of experts; (iv) we 
calculate non-parametric convergence rates of the maximum likelihood estimator; (v) 
we show non-parametric consistency when the number of experts and the sample size 
increase; and finally (vi) that using polynomials in the experts one can get better estima- 
tion and error bounds. These developments have shed light on the important questions 
of how many experts should be chosen and how complex the experts themselves should 
be. 
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A Showing the convergence rate 

In this appendix we explain and justify the main steps in proving the convergence rate. 

One of the drawbacks of working with the KuUback-Leibler divergence is that it is 
not bounded. An alternative is to use the Hellinger distance. 

Definition A.l (Hellinger Distance). Let P and Q denote two probability measures 
absolute continuous with respect to some measure A. The Hellinger distance between 
P and Q is given by 



Alternatively, the Hellinger distance between two densities p and q with respect to X is 
given by 



One can show that if the likelihood ratio is bounded, the KL divergence is bounded 
by a constant times the square of the Hellinger distance. We use the following result due 
to Yang and Barron [2002], which is presented together with a basic inequality relating 
the Hellinger distance and KuUback-Leibler divergence. 

Lemma A.l (Yang and Barron [2002]). Let p^y = dP and \\Pxy/ f\\oo,nxA < c^, for 
f G J^m,k- Then 




(18) 




(19) 



dliPccy, /) < KL{p^y, f) < 2(1 + \ogCs)dl{pxy, /) 
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where dh{p, /) stands for the Hellinger distance between the densities p and f with 
respect to X. 

This Lemma implies that the KuUback-Leibler divergence is bounded by the square 
of the Hellinger distance, and therefore the convergence rate in the square of the Hellinger 
distance is the same as the convergence rate in the KuUback-Leibler divergence. The 
only problem is that, in general, the boundedness condition does not hold on the whole 
set A (the support of Y). One could overcome this complication by finding the conver- 
gence rate inside some subset of A where the KL divergence is bounded and control the 
tail probability outside this subset. 

Let S(Y, X) denote a scalar function of (Yi, X[), X^)' and B{(3) = {y e 
A:\y\< (3}. For every K eR, 

P{S{Y,Y) > K)< P{{S{Y,X) > K}r\B{l3)) + P{\Y\ > /3). (20) 

If A is bounded, we can choose (3 = ess sup \A\, and the second term on the right hand 
side will be zero. Otherwise, we can take (3 to be large enough such that -P(|l^| > /3) is 
small or converges to zero at some rate. 

In order to bound the estimation error we shall use results from the theory of empir- 
ical processes. The convergence rate theorem presented below is derived for the i.i.d. 
case, however the same result holds for martingales (see van der Geer [2000]). 

To control the estimation rate inside a class of functions we have to measure how 
big is the class. Let Nsie, J-", || ■ ||) denote the number of fir-brackets'^ with respect to 
the distance || ■ ||, needed to cover the set T and Hb{s, J^, || • ||) = log A''b(£:, J^, || ■ ||) 
the respective bracketing entropy. Moreover, let const, denote some finite universal 
constant that may change each time it appears, and write J^ll\ = {vT : / ^ ^m,k}- 
We show that, under some conditions, 

C 

Hsie, J^l^l, II ■ ||2,qxa) < const. {mJk + Vm) log — , 

for some finite constant C not depending on e. 

^For a formal definition of Bracketing Numbers see van der Vaart and We liner [1996] chapters 2.1 
and 2.7 
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1/2 

Hence, our first task is to find the bracketing entropy of ^. Assumption 1 implies 
that 

where < 5^ = gjxihkix; 6*^), y)//,„,fc < 1 and = 1- 

Hence, for any /i and /2 in some Tm,k indexed respectively by the parameter vectors 
Ci and C2 in x G^fc, 

l/A- V^l <c(x,y)|Ci-C2|2, (21) 

with c{x,y)'^ = {y/]i/2)[\F{x,y)'F{x,y)\ + \cg{x)'cg{x)\]. Therefore, the square- 
root densities in Trn^k are Lipschitz in parameters, with Lipschitz function c(x, y) G 
L\X,Q X A). 

Lemma A.2 (Bracketing Entropy). Under assumption 1, for any < 5 < 1, 

C 

HB{e,J^l^l, II ■ II2) < const.{mJk + v^) log— , (22) 
where C = 2||c(X, F) ||2,nxAdiam(Kn x G^fc); and 

I \ogH]i\u,J'lll II ■ \\2)du < const.{m.h + ^^)'/'51og^/' ^ (23) 



Proof. The first part of the lemma follows from Lemma C.5 and equation (21) together 
with assumption 1 . 

The second inequality follows from Lemma C.2. If we take h = 5 and C > e'^, we 
have 

H^i^^J^'Jl II ■ h)du < const.{mJ, + v^y/'Shg'^' j. 
Proving the lemma. □ 

Lemma A.l requires the likelihood ratio \pxy/ fmk\ ^ be bounded. Next lemma 
shows the rate of decay for the tail probability P{\Y\ > as a function of m and Jk. 

Lemma A.3. Let p e n(>V^^g) and consider densities on J^rn,k- Then, under assump- 
tion 5, in a set with probability not smaller than 1 — rj; 

p 



sup inf log 



< const. (mJfc)-°/"(cdoo + fcoo) (24) 

oo,Q 
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where r] = ||Var(F|X = x)/c^\\oo,n, c is some large constant, possibly, depending on 
m and k, and the constants hoo and boo are defined as 

ttoo = ess sup \da{hk{x,6))/d9\, and 
nx0 

boo = ess sup \db{hk{x, 6))/d6\. 



Proof. Setctoo = ess sup^^e and = esssup^^e |f^^(^fc(3^! 

For ease of notation, also set hki{-) = hk{-, 9i). By the convexity of the logarithm, 



log^ < ^5ilog(p/7r,) 
^ i=i 

m 

^gi{ya{h{x)) - a{hki{x))) + b{h{x)) - b{hki{x))) 



i=l 



00 I' ^00) 



^ \gi - InMKx) - hkiix) \ + - hkiix)\ 



i=l 



i=l 



Then, by Assumption 5 and proceeding the same way as in the proof of Theorem 3.1, 
and taking any value c 



ess sup 

x£Q,\y\<c 



P 

sup inf log — 

p f^J~ m,k J 



< const. [m J k) ""^^{cdoo + &00) 
The result follows by a simple application of the Chebyschev's inequality. 



□ 



The bound on (24) by itself is not enough since we need to relate the function 

satisfying (24) with fc. It follows from Lemma C.3 that for any < q < 1 

. p 1 , p 

log < log . 

If we choose ci small enough, we can find a Cp satisfying 



log- 



P 



^ 1 P 
< c„log-— . 



Combining this result with the previous lemma we have inside -B(/3) 

ess sup \pxy/ f^k\ < Coo exp const. (mJfc)""/'*(moo + ^00) 
nx0 ' L 

where Coo = e'^^/^^"'^'). 

Now we can use theorem 10.13 in van der Geer [2000] to show the rate of conver- 
gence of the Hellinger distance between maximum likelihood estimator and the true 
density. For sake of completeness the theorem is shown below 
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Theorem A.l (Theorem 10.13 in van der Geer [2000], pg. 190). Let fm,k denote the 
maximum likelihood estimator ofp^y over Fm,k- Set 



^z^^) = \ ^A ■ f e ^n.,k, 4 (^/^, r i < ^ 



for some fixed f* E Fm,k and let \\pxy/ f*\\oo,QxA < c^. Choose 

'0+ 

in such a way that is a non-increasing function of S. Then, for y/nS'^ > 

const. \l/((5„), we have 



^(6) > f hgH'J\u,THli5), II ■ h)du V 6, 
Jo+ 



dh{Pxy, fm,k) = Op {5n + dh{f^^k^Pxy)) 



B Proof of the main results 

Proof of theorem 2.1. The data generating process of {x,y) and the structure of the 
model {,„,fc is enough to satisfy the measurability assumptions, i.e. it is a weighted sum 
of measurable functions, also, for any fixed (xj, yi), each Hj is a continuous function of 
C Pa;j/-almost surely, and the same holds for the g'jS, then, fm,kiXi, Yf, ■) is a continuous 
function of the parameters P^^j^-a.s. The result follows from theorem 2.12 in White 
[1996]. 

□ 

Proof of Theorem 2.2. There are different approaches to show consistency of the esti- 
mate, we proceed by verifying the conditions of theorem 3.5 in White [1996]. 

The first assumptions regarding the existence of the estimate, are already shown 
to be satisfied in theorem 2.1. Assumption 3.2, regarding identifiability is satisfied 
by assumption 3 and 4. It remains to satisfy assumption 3.1, regarding boundedness 
and uniform convergence of the log-likelihood function. We can show continuity of 
E log fm.kiX, Y; () by noting that we can interchange integration with limits and a first 
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order Taylor expansion: 

fm.k{X,Y;C) 



E 



log 



U,k{X,Y;C-e) 



< supE 

< sup E 



1/9 
dC 

, d , d 

I -Q^fm,kiX, Y] C)'-Q^fm,kiX, Y] Q I 



1/2 



which is bounded by lemma C.l and by the fact that e is arbitrary. 

To show uniform convergence of the likelihood function, we satisfy the conditions 
of theorem 2 in Jennrich [1969]. By assumption, Vm x 6m, fc is a compact subset of 
j^umxmjfc Measurability and continuity conditions are already satisfied, then it remains 
to show that log fm,k is bounded by an integrable function. Note that we can bound the 
log-likelihood function by: 



^^^frnAX^Y^]^ = |logf;^7,(X;z/)(vr(/i,(X;^,),2/) -c(2/))| 

i=l 



i 

< max ess sup \a{hk{x; 9i)\\Y\ + \b{hk{x;9i))\ 



l<i<m 



Define the bounding function D (X, F) = maxi<i<messsup2,gj^[|a(/i.fc(X; 6'j))| x|F| + 
\b{hk{X;9i))\]. The function I 6^) I < J2i=i\^i\ < oo because 1 and 

J2i l^il < OO, then both a{hk) and b{hk) are finite. Thus, it is straightforward to show 
that ED{X, Y) < oo, given that Ey\x{Y) < oo, which is satisfied by assumption about 

p{y\x). 

Then, log Y; Q -^a.s. log /m,fc(^, Y; (*) as n -> oo.Therefore, by theorem 

3.5 in White [1996], Cn ^ C* PxY-a.s. as n ^ oo. 

□ 

Proof of Theorem 3.1. To bound the approximation rate of the KuUback-Leibler diver- 
gence it is enough to bound the upper divergence V{fm,k, p). 



1^{fm,k,p)= / y29ji^''^){hk{x,6j) - h{x)ydP^ 
Jn _i 



(25) 



Assumption 5 ensure the existence of a such that maxj \\gj{-]i^ci) —Iq"^ (Oil d,P:, 



< 



Ci/rm\\dPx/d\\\oo,n, where \\dPx/d\\\oo,n is finite because P^. has continuous density 
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function with respect to the finite measure A on fi. Consider 



(^i) 

+ |lE%(-){^'^(-;^^)-^(-)>'lli,p.- (26) 

V ' 

Now we just have to find bounds for both terms in the right hand side of (26)(Ai 
and A2). The second term can be written as 



Y^iQ^{.){h,{.-e,)-h{.)YdP^ 
= / |E^Qr(-)[^fc(-;^.)-M-)]| dp^, 

where the equality follows from the fact that Igrnlgn, = /gm/j^^-, and J2j ^Qfi') = 1- 
If k < a, one can choose 6j such that sup^.gQm \hk{x,6j) — h{x)\ < [Ko/{k + 
l)!]diam(QJ')'^"'"^ where k = (ki, . . . ,ks) is an integer vector satisfying \k\ = k . This 
claim follows from a Taylor expansion of h{x) around fixed points Xj E QJ" and the 
fact that h G ^V^k^^- Similarly, if A; > a we can only use the expansion up to a terms. 
By assumption 5, sup^ diam((5™) < l/rm'^-Then 

sup sup \h,{x; e,) - h{x)\ < ^^^xlW ' (2^^ 
3 x&Q'P rin 

where cq depends only on Kq and min(A; + 1, a). 

Therefore, (A^) < c2/r^"^('=+')l/^ Note that 

m 

< sup sup \hk{x;ej) - V u,) - /q-(-)||i,p. 



2[aA(fe+l)]/s' 



where the last inequality is due to equation (27) and assumption 5. 
Combining the results for (Ai) and {A2), 
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It follows from lemma 3.1 that 

By assumption, {r^} is sub-geometric, then there exists such that r„i < m < 

rm+i, and > l/m^"/^ > l/r^°/j. Then, l/(r^Jfc)2"/^ > l/(mJfc)2"/^ > 

l/(r^+i By definition of C J",; c Hence, 

inf^ KL{p,fra,k)< inf^ KL{p,fm,k) 

^ M2C2 



2[QA(fc+l)]/s 
' m 



2[«A(fc+l)]/s 
' m+1 

< 

— ^2[QA(fc+l)]/s ' 



where C2 = M^Co(ci + 1) and C3 = M2C2 does not depend on /. Therefore, 



C3 



sup inf KLip, fmk) < — „r , -iM / 



□ 



Proof of Theorem 3.2. The the first step is to use equation (20) to bound the conver- 
gence rate inside B{/3) and bound the tail probability. Choose /3 = c = (mJ^)"/"^, and 
take cl = c^econsi.[aoo+boo] follows from Lemma A. 3 and the discussion afterwards 
that inside B{f3) 

\\Pxy/ fm,k\\oo,n < Cocexp{const.[doo + (mJfe)~^^'^oo} 
< Coo exp{const[aoo + ^00]} 
= 4 

This choice of (3 gives us 

r] = P{\Y\ > /3) = esssup Var(r|X = x)(mJfc)-2"/^ (30) 

X 

by lemma A. 3, and 

-d{h{x))b{h{x)) - b{h{x))a{h{x)) 



esssup Var(F|X = x) = ess sup 



{d{h{xW 
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which is bounded by definition. 

Now, inside B{j3), we can apply Theorem A.l in the appendix setting /* = f*^ ^. 

~ 1 /2 

We use Corollary C.l to bound the bracketing number of J^^nki^)- Lemma A.2 

/ H'J\u,TKI{6), II ■ II2) du < const.{mJk + v^fHlog^/^ (31) 

Since C oc VmJk + v^, we can choose oc (mJ^) V25iogV2 f^ imj^+v^y^^ y ^j^-^ 
choice of function satisfies ^{6)/5'^ is non-increasing, and we can take (5„ = {mJ^ + 
Vmy^'^{^y\ogn/n). In fact, this choice of 5„ gives us 

y/nSn > const. "^{Sn) 



, ^ ^ /m Jfc + Vm , 1/2 ('TiJ/c + ^'m)^/^ 

d)„ > const. \ log ' 

\ n dn 

ImJk + Vm ( 1,1/2 1, logn\ 
= const. \ —= log ' n log — — 

const. const. mJk + Vm , logn 
Hence the convergence rate in Hellinger distance is given by: 



a/2. I^ogn 



n 



dh{fm,k,Pxy) = Op ^h{fm,k^Pxy) + i^Jk + ^ 

Our choice of (3 allows to apply Lemma A.l to obtain 

(log n 

We use Theorem 3.1 to conclude that, inside B{I3), 

(1 log n \ 

(^^yr/s + ^^"^k + Vm)—^\ . (32) 

Combining this rate inside B{f3) with (30), we arrive in our result (14). 
We achieve the best rate (15) by taking m oc (logn/n)~'*/^^+^ and substituting this 
rate in (14). 

□ 

Proof of Theorem 3.3. The proof is parallel to 3.2, with just some small changes to 
lemma A.2. More precisely, since we have a unique f^j^ for each {m,k), we can 

1 /2 

find the bracketing number inside F^^,{5). The argument is the same with the only 
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difference that inside FJ^j,{5), diam(ym x Qrnk) = const. 5, removing the log-term on 
the right hand side of (3 1). 

This change allows to choose 5n = \/ {mJk + Vm)/n, removing the log n term of 
the rate. 

□ 

Proof of Proposition 4.1. I. Under the constraint m^" = C,U = 
Consider two cases: When ^> a,U obviously increases with ^. So the optimal S,o < «• 
When ^ < a, computing the derivative and we know that the function U is minimized 
at ^ = {C^/^ /e). When this point is to the right of a, the function U decreases for 
all ,^ < a. So we obtain = When (C^/^/e) is to the left of a, the minimum is 
achieved at (C^/'^/e) and = (C^/^'/e). Combining these we obtain the minimizer 
eo = « A (C^Ve). 

II. and III. They are straightforward from (f ). 

□ 

C Auxiliary Results 

In the next lemma, we use the notation de = d/dO, dee' = d'^ /dOdO', aj = a{hk{x; 9j)), 
Qj = deaj, 'dj = dee'dj and so on. 

Lemma C.l. Let f G J^m.k- Under assumption 1 

• E| log /I < oo 

• E|Vlog/| < oo 
. E|Vlog/|2<oo 

• if we further assume 3 and 4, then E| log / | < oo and is nonsingular at C,*. 

Proof. This theorem is proved by calculate the derivatives and bounding it. 

First note that aj and hj are continuous differentiable functions of hk{x] Oj). Since 
\hk{x]6j)\ < \6j\ < y/Jk\9j\2 < oo for any fixed k, then both aj and bj are also 
bounded. The same reasoning can be applied to and bj. Also, by definition, 

E||/|'P < oo for any p > O.Then 
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Elog/ < E[logN^(7,-7rj] < E| max[?/a,- + bj + c{y)] \ < oo. 
Let 5j = gjey^^'^^Px/ f < 1 and c* = max^- \\d^\og gj\\oo,n, then 

E|9,^, log /I = E\6,{ydj + b,)x\ < oo, 
E|a^log/| = E\^^^ 1 < mc* < oo. 

The same follows for Ej^e log E|9^ log/|i and E|9e log log /|. Let c* 
\\d,y\oggj\\oo,n, and choose any vector a with appropriate dimensions satisfying a' a 
1 then 



Ka'\dg^e', log/|a = Ea|5j(l — 6j){yaj + bjYxx' + Sj^y'dj + bj)xx'\a 
< 0.25E|ydj + bjl'^ + Emax \ydj + bj\ < oo, 



Ea'\dg^o'^ log f\a = Ea'\ - 6j6k{ydk + bk){ydj + bj)xx'\a 
< E|(ydfc + bk){ydj + bj)\ < oo. 



Ea'ia ,, log/|« = E« 



•x , 

a 



f 

< Ea'\Sj{ydj + bj)xlv^'\ac* 

< E\yaj + bj\c* < oo, 

Ea \d^^i\ogf\a = Ea \ — j j '-^^ 

< c*\c* \ + c*^ < oo. 

Since is a maximizer of Elog/ over J-'m,k, E|V^ log/| has to be non-negative 
definite. Assumption 4 tells us it is also invertible, therefore E|V^log/| is positive 
definite. 

□ 

Lemma C.2. For any < a < b < 1 and a positive constant C, 

pog'/' ^du<b(^V^ log'/' (33) 
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Proof. For any < a < 6 < 1, 



^ ilogi''2(c/b) 
J logic /b) 

= CT(3/2,log(C/6) 
< 6(v^+log^/2(C/6) 



where the last inequality follows from 



r(3/2, x) = ^r(l/2, a;) + x^/^g-^ 



Lemma C.3. Let p and q denote two positive densities. For any < q < 1, 



□ 



log ^ < ^ log ——^ . (34) 

q 1 - Q + (1 - ci)q 



Proof. By the convexity of the logarithm we have 

P 

-T-s r = ~ ^°s(q + (1 - ci)q/p) 

cip + (1 - ci)q 

> Q(-logl) + (1 - Q)(-logg/p) 

= (l-Q)log^ 
Q 



□ 



Lemma C.4 (Lemma 4.2 in van der Geer [2000]). We have, for fi, f2 and some /*, 
that 

V24 (^^^^, ^^^1 < /2). (35) 



This lemma is similar to lemma 4.2 in van der Geer [2000], with the only difference 
being that we consider an arbitrary /* and van der Geer [2000] considers /* to be the 
true density. The proof remains unchanged. 
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~ 1 /2 

Corollary C.l. Let J^Z li^) be as in theorem A.l, and 

We have that NBie,^'Jl{V26), \\ ■ h) < NBis, J^'J^iS), \\ ■ h). 

Proof. The proof follows from lemma C.4, taking fi = f and /2 = /* = /*• We have 
that an (-\/2£:) -bracket net for ^F^^iV^d) is also an e-bracket net for J'l^1{6), all with 
respect to || -112. □ 

The next lemma provides a bound on the bracketing number of functional classes 
that are Lipschitz in a parameter. 

Lemma C.5 (Theorem 2.7.1 1 in van der Vaart and Wellner [1996]). Let F = {ft - t e 

T} be a class of functions satisfying 

\fs{x)-ft{x)\<d{s,t)F{x), 

for some metric d on T, function F on the sample space and every x. Then for any 
norm \\ ■ \\, 

NB{2e\\F\\,F,\\-\\)<N{e,T,d), (36) 
where N{e, T, d) is thee-covering number ofT with respect to the metric d. 

It is straightforward to see that if we set dim(r) = d, ct = diam(T), and C = 

M\F\\2, 

NB{e,F,\\-\\2)< 
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